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Abstract —We consider the problem of estimating self-exciting 
generalized linear models from limited binary observations, 
where the history of the process serves as the covariate. We 
analyze the performance of two classes of estimators, namely the 
I \-regularized maximum likelihood and greedy estimators, for 
a canonical self-exciting process and characterize the sampling 
tradeoffs required for stable recovery in the non-asymptotic 
regime. Our results extend those of compressed sensing for 
linear and generalized linear models with i.i.d. covariates to 
those with highly inter-dependent covariates. We further provide 
simulation studies as well as application to real spiking data from 
the mouse’s lateral geniculate nucleus and the ferret’s retinal 
ganglion cells which agree with our theoretical predictions. 

Index Terms —compressed sensing, generalized linear models, 
sparsity, spontaneous activity, neural signal processing. 


I. Introduction 

The theory of compressed sensing (CS) has provided a novel 
framework for measuring and estimating statistical models 
governed by sparse underlying parameters ID-®. In partic¬ 
ular, for linear models with random covariates and sparsity 
of the parameters, the CS theory provides sharp trade-offs 
between the number of measurement, sparsity, and estimation 
accuracy. Typical theoretical guarantees imply that when the 
number of measurements are roughly proportional to sparsity, 
then stable recovery of these sparse models is possible. 

Beyond those described by linear models, observations from 
binary phenomena form a large class of data in natural and 
social sciences. Their ubiquity in disciplines such as neuro¬ 
science, physiology, seismology, criminology, and finance has 
urged researchers to develop formal frameworks to model and 
analyze these data. In particular, the theory of point processes 
provides a statistical machinery for modeling and prediction 
of such phenomena. Traditionally, these models have been 
employed to predict the likelihood of self-exciting processes 
such as earthquake occurrences Q, j8j, but have recently 
found applications in several other areas. For instance, these 
models have been used to characterize heart-beat dynamics 
0, Col and violence among gangs CD. Self-exciting point 
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process models have also found significant applications in 
analysis of neuronal data lfT2ll - (T8l . 

In particular, point process models provide a principled way 
to regress binary spiking data with respect to extrinsic stimuli 
and neural covariates, and thereby forming predictive statisti¬ 
cal models for neural spiking activity. Examples include place 
cells in the hippocampus fl2l . spectro-temporally tuned cells 
in the primary auditory cortex ED, and spontaneous retinal 
or thalamic neurons spiking under tuned intrinsic frequencies 
El, ED- Self-exciting point processes have also been utilized 
in assessing the functional connectivity of neuronal ensembles 
ll22ft . 1 1231 . When fitted to neuronal data, these models exhibit 
three main features: first, the underlying parameters are nearly 
sparse or compressible f23l . (24); second, the covariates are 
often highly structured and correlated; and third, the input- 
output relation is highly nonlinear. Therefore, the theoretical 
guarantees of compressed sensing do not readily translate to 
prescriptions for point process estimation. 

Estimation of these models is typically carried out by 
Maximum Likelihood (ML) or regularized ML estimation in 
discrete time, where the process is viewed as a Generalized 
Linear Model (GLM). In order to adjust the regularization 
level, empirical methods such as cross-validation are typically 
employed ( 23| . In the signal processing and information theory 
literature, sparse signal recovery under Poisson statistics has 
been considered in E3 with application to the analysis of 
ranking data. In (26), a similar setting has been studied, with 
motivation from imaging by photon-counting devices. Finally, 
in theoretical statistics, high-dimensional M-estimators with 
decomposable regularizes, such as the t \ - norm, have been 
studied for GLMs ED- 

A key underlying assumption in the existing theoretical 
analysis of estimating GLMs is the independence and identical 
distribution (i.i.d.) of covariates. This assumption does not 
hold for self-exciting processes, since the history of the process 
takes the role of the covariates. Nevertheless, regularized ML 
estimators show remarkable performance in fitting GLMs to 
neuronal data with history dependence and highly non-i.i.d. 
covariates. In this paper, we close this gap by presenting new 
results on robust estimation of compressible GLMs, relaxing 
the common assumptions of i.i.d. covariates and exact sparsity. 

In particular, we will consider a canonical GLM and will an¬ 
alyze two classes of estimators for its underlying parameters: 
the 1 1 -regularized maximum likelihood and greedy estimators. 
We will present theoretical guarantees that extend those of 
CS theory and characterize fundamental trade-offs between 
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the number of measurements, model compressibility, and 
estimation error of GLMs in the non-asymptotic regime. Our 
results reveal that when the number of measurements scale 
sub-linearly with the product of the ambient dimension and a 
generalized measure of sparsity (modulo logarithmic factors), 
then stable recovery of the underlying models is possible, 
even though the covariates solely depend on the history of the 
process. We will further discuss the extensions of these results 
to more general classes of GLMs. Finally, we will present 
applications to simulated as well as real data from two classes 
of neurons exhibiting spontaneous activity, namely the mouse’s 
lateral geniculate nucleus and the ferret’s retinal ganglion cells, 
which agree with our theoretical predictions. Aside from their 
theoretical significance, our results are particularly important 
in light of the technological advances in neural prostheses, 
which require robust neuronal system identification based on 
compressed data acquisition. 

The rest of the paper is organized as follows: In Section 
mi we present our notational conventions, preliminaries and 
problem formulation. In Section [TTl] we discuss the estimation 
procedures and state the main theoretical results. Section [IV] 
provides numerical simulations as well as application to real 
data. In Section [VJ we discuss the implications of our results 
and outline future research directions. Finally, we present 
the proofs of the main theoretical results and give a brief 
background on relevant statistical tests in Appendices IaHdI 

II. Preliminaries and Problem Formulation 

We first give a brief introduction to self-exciting GLMs 
(see |28l for a detailed treatment). We will use the following 
notation throughout the paper. Parameter vectors are denoted 
by bold-face Greek letters. For example, 9 = [#i, 62 , ■ • • , 9 p ]' 
denotes a p-dimensional parameter vector, with [•]' denoting 
the transpose operator. We also use the notation x;- to represent 
the (j — i + 1)-dimensional vector [xi, Xi+i, ■ ■ ■ , Xj\' for 
any i.j £ Z with i < j. For a vector 9, we define its 
decomposition into positive and negative parts given by: 

9 = 9 + - 9 ~, 

where 9 ± = rnax{ ± 9 . 0}. It can be shown that 


are convex in 9 . 

We consider a sequence of observations in the form of 
binary spike trains obtained by discretizing continuous-time 
observations (e.g. electrophysiology recordings), using bins of 
length A. We assume that not more than one event fall into 
any given bin. In practice, this can always be achieved by 
choosing A small enough. The binary observation at bin i is 
denoted by x^. The observation sequence can be modeled as 
the outcome of conditionally independent Poisson or Bernoulli 
trials, with a spiking probability given by P(xj = 1) =: A 
where A i \jj i is the spiking probability at bin i given the history 
of the process Hi up to bin i. 

These models are widely-used in neural data analysis and 
are motivated by the continuous time point processes with 
history dependent conditional intensity functions (28). For 


instance, given the history of a continuous-time point process 
H t up to time t, a conditional intensity of X(t\H t ) = A 
corresponds to the homogeneous Poisson process. As another 
example, a conditional intensity of A(f \H t ) = /x + f_ oc 9{t — 
t)cIN(t) corresponds to a process known as the Hawkes 
process (29) with base-line rate // and history dependence 
kernel $(•). Under the assumption of the orderliness of a 
continuous-time point process, a discretized approximation to 
these processes can be obtained by binning the process by 
bins of length A, and defining the spiking probability by 
A i := A(?'A|fTiA)A + o(A). In this paper, we consider discrete 
random processes characterized by the spiking probability 
A,|// j , which are either inherently discrete or employed as an 
approximation to continuous-time point process models. 

Throughout the rest of the paper, we drop the dependence 
of \\Hi on Hi to simplify notation, denote it by A, and 
refer to it as spiking probability. Given the sequence of binary 
observed data x", the negative log-likelihood function under 
the Bernoulli statistics can be expressed as: 

1 ” 

£( 0 ) =-V'{xjlogAj + (1 - Xi) log(l - Ai)} . ( 1 ) 

n z ' 

i=l 

Another common likelihood model used in the analysis of 
neuronal spiking data corresponds to Poisson statistics m, 
for which the negative log-likelihood takes the following form: 

1 n 

£{ 9 ) :=- 'V' {xi log A i - A*} . ( 2 ) 

n z ' 

i=l 

Throughout the paper, we will focus on binary observations 
governed by Bernoulli statistics, whose negative log-likelihood 
is given in Eq. ([]}. In applications such as electrophysiology 
in which neuronal spiking activities are recorded at a high 
sampling rate, the binning size A is very small and the 
Bernoulli and Poisson statistics coincide. 

When the discrete process is viewed as an approximation 
to a continuous-time process, these log-likelihood functions 
are known as the Jacod log-likelihood approximations (28) - 
We will present our analysis for the negative log-likelihood 
given by 0. but our results can be extended to other statistics 
including (0 (See the remarks of Section Hill ). 

Throughout this paper x ” p+1 will be considered as the 
observed spiking sequence which will be used for estimation 
purposes. A popular class of models for Xi is given by 
GLMs. In its general form, a GLM consists of two main 
components: an observation model and an equation expressing 
some (possibly nonlinear) function of the observation mean as 
a linear combination of the covariates. In neural systems, the 
covariates consist of external stimuli as well as the history 
of the process. Inspired by spontaneous neuronal activity, we 
consider fully self-exciting processes, in which the covariates 
are only functions of the process history. As for a canonical 
GLM inspired by the Hawkes process, we consider a process 
for which the spiking probability is a linear function of the 
process history: 

A + (3) 

where fi is a positive constant representing the base-line rate, 
and 9 = [$i, 6 * 2 , • • •, 9 P ]' is a parameter vector denoting the 
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history dependence of the process. We further assume that 
the process is non-degenerate, i.e., it will not terminate in an 
infinite sequence of zeros. We refer to this GLM, viewed as a 
random process, as the canonical self-exciting process. Other 
popular models in the computational neuroscience literature 
include the log-link model where A * = exp(p + 0'~xfZz) 

and the logistic-link model where A * = l - xp h 1 + 6> <_p) _ qqjg 

l+exp(/i+0'x i _ p ) 

parameter vector 8 can be thought of as the binary equivalent 
of autoregressive (AR) parameters in linear AR models. 

When fitted to neuronal spiking data, the parameter vector 8 
exhibits a degree of sparsity ED, El, that is, only certain lags 
in the history have a significant contribution in determining the 
statistics of the process. These lags can be thought of as the 
preferred or intrinsic delays in the spontaneous response of a 
neuron. To be more precise, for a sparsity level s < p, we 
denote by 8 S the best s-term approximation to 8 . We also 
define 

o s (8):=\\8-8 s \\ u (4) 

which is a scalar function of 8 and s, and captures the 
compressibility of the parameter vector 8 in the l\ sense. For 
a fixed £ £ (0,1), we say that 8 is (s, ^-compressible if 
<T,( 0 ) = Ois 1 -?) 0. Note that when £ = 0, the parameter 
vector 8 is exactly s-sparse. 

Finally, in this paper, we are concerned with the compressed 
sensing regime where n <C p, i.e., the observed data has 
a much smaller length than the ambient dimension of the 
parameter vector. The main estimation problem of this paper 
is the following: given observations x™ p+1 from the canonical 
self-exciting process, the goal is to estimate the unknown 
baseline rate p and the p-dimensional (s. f)-compressible 
history dependence parameter vector 8 in a stable fashion 
(where the estimation error is controlled) when n -C p. 

III. Theoretical Results 

In this section, we consider two estimators for 8 , namely, 
the 1 1 -regularized ML estimator and a greedy estimator, and 
present the main theoretical results of this paper on the 
estimation error of these estimators. Note that when p is not 
known, the following results can be applied to the augmented 
parameter vector [p, 8 ']'. We analyze the case of known p for 
simplicity of presentation. 

A. (\-Regularized ML Estimation 

The natural estimator for the parameter vector is the ML 
estimator, which is widely used in neuroscience (23. which 
by virtue of 0 is given by: 

0ml = argmin£(0), (5) 

<9G® 

where © is the relaxed closed convex feasible region for which 
0 < Xi < 1 given by the conditions: 

0 < 7T min < p - 1'8~, 

p + 1 '8+ < ^ max < 1/2, 1 J 

for some constants 7T m i n and 7r max . This first inequality incurs 
minimal loss of generality, as 7r m i n can be chosen to be 


arbitrarily small. The restriction of 7r max <1/2 ensures that 
the process is fast mixing and has mainly been adopted for 
technical convenience. This assumption incurs some loss of 
generality, as it excludes processes for which the maximum 
spiking probability exceeds 1/2. However, due to the low 
spiking probability of typical neuronal activity, this loss is 
tolerable for the applications of interest in this paper (see 
Section IIVI ). 

In the regime of interest when n ■< p, the ML estimator 
is ill-posed and is typically regularized with a smooth norm. 
In order to capture the compressibility of the parameters, we 
consider the (\-regularized ML estimator: 

0 sp := argmin £(0) +7 n ||0||i- (6) 

0<E® 

where 7 n > 0 is a regularization parameter. It is easy to 
verify that the objective function and constraints in Eq. (0 
are convex in 8 and hence 8 sp can be obtained using standard 
numerical solvers. Note that the solution to 0 might not be 
unique. However, we will provide error bounds that hold for 
all possible solutions of 0. with high probability. 

It is known that ML estimates are asymptotically unbiased 
under mild conditions, and with p fixed, the solution converges 
to the true parameter vector as n —> oo. However, it is not clear 
how fast the convergence rate is for finite n or when p is not 
fixed and is allowed to scale with n. This makes the analysis of 
ML estimators, and in general regularized M-estimators, very 
challenging 127]. Nevertheless, such an analysis has significant 
practical implications, as it will reveal sufficient conditions 
on n with respect to p as well as a criterion to choose 7 n , 
which result in a stable estimation of 8 . Finally, note that we 
are fixing the ambient dimension p throughout the analysis. In 
practice, the history dependence is typically negligible beyond 
a certain lag and hence for a large enough p, GLMs fit the 
data very well. 

B. Greedy Estimation 

Although there exist fast solvers to convex problems of the 
type given by Eq. 0, these algorithms are polynomial time in 
n and p, and may not scale well with high-dimensional data. 
This motivates us to consider greedy solutions for the estima¬ 
tion of 8 . In particular, we will consider a generalization of 
the Orthogonal Matching Pursuit (OMP) lf30l . fJTl for general 
convex cost functions. A flowchart of this algorithm is given 
in Table U which we denote by the Point Process Orthogonal 
Matching Pursuit (POMP) algorithm. At each iteration, the 
component in which the objective function has the largest 
deviation is chosen and added to the current support. The 
algorithm proceeds for a total of s* steps, resulting in an 
estimate with s* components. 

The main idea behind the generalized OMP is in the greedy 
selection stage, where the absolute value of the gradient of 
the cost function at the current solution is considered as the 
selection metric. Consider an estimate at the (k— l)-st 

stage of the generalized OMP for a quadratic cost function of 
the form ||b— A0|||, with b and A denoting the observation 
vector and covariates matrix, respectively. Then, the gradient 
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Input: £(0),s* 

Output: 0poi\/ip 

r Start with the index set S = 0 
Initialization: <! . w 0 ) 

l and the initial estimate #po M p — 0 

for k = 1, 2, • • •, s* 

J = ar s max I (V£ (flpoMP ) ) J 
s( fc ) = S( fc - 1 ) U{j} 

®POMP = ar s min £ ( 0 ) 
su pp(e)cs( fe ) 

end 


TABLE I: Point Process Orthogonal Matching Pursuit (POMP) 


takes the form A'(b — A9 t ' k ~ 1 ' > ) which is exactly the correla¬ 
tion vector between the residual error and the columns of A 
as in the original OMP algorithm. 


C. Theoretical Guarantees 

Recall that the parameter vector 9 £ R p is assumed to be 
(s,^-compressible, so that a s (9) = ||0 — 0s||i= 0(s 1 ~(), 
and the observed data are given by the vector x ” p+1 £ 
{ 0 , 1 } n +J' _ l , all in the regime of s . n <C p. In the remainder of 
this paper, we assume that 9 £ ©. The main theoretical result 
regarding the performance of the I \-regularized ML estimator 
is given by the following theorem: 


Theorem 1. If a s (9) = 0{yfs), there exist constants 

di,d 2 ,d 3 and d^ such that for n > d\S 2 ^ 3 p 2 ^ 3 logp and a 
choice of y n = d 2 any solution 9 sp to (( 6 ]) satisfies the 

bound 


9 sp -9 


V n V n 

with probability greater than 1 — 0 (—/j-). 

Similarly, the following theorem characterizes the perfor¬ 
mance bounds for the POMP estimate: 


Theorem 2. If 9 is (s, ^-compressible for some < 1 / 2 , 
there exist constants dfi , df ■ df and d'± such that for n > 
d , 1 s 2 / 3 p 2 / 3 (log s ) 2 / 3 logp, the POMP estimate satisfies the 
bound 




#POMP — 9 

2 < ^ 


s logs log p 
n 


j lo § s 

- t 3 1 9 

si 2 


( 8 ) 


after s * = (U(slogs) iterations with probability greater than 

Full proofs of Theorems Q] and [2] are given in Appendix [A] 
Remarks. An immediate comparison of the sufficient condition 
n = 0(s 2 / 3 p 2 / 3 logp) of Theorem [T] with those of | |27l for 
GLM models with i.i.d. covariates given by n = 0(s log p) 
reveals that a loss of order 0 (p 2 / 3 s -1 / 3 ) is incurred due to 
the inter-dependence of the covariates. However, the sample 
space of n i.i.d. covariates is np-dimensional, whereas in our 
problem the sample space is only (n+p) -dimensional. Hence, 
the aforementioned loss can be viewed as the price of self¬ 
averaging of the process accounting for the low-dimensional 
nature of the covariate sample space. To the best of our 


knowledge, the dominant loss of 0(p 2 / 3 ) in both theorems 
does not seem to be significantly improvable, as self-exciting 
processes are known to converge quite slowly to their ergodic 
state |[32ll . On a related note, the analysis of the sampling 
requirements of linear AR models reveals a loss of 0(p 1 / 2 ) 
in the number of measurements El- 

The sufficient condition of Theorem U given by n = 
0(s 2 / 3 p 2 t 3 (logs ) 2 / 3 logp) implies an extra loss of (logs ) 2 / 3 
due to the greedy nature of the solution. Theorem [2] also 
requires a high compressibility level of the parameter vector 
9 (£ < 1/2), whereas Theorem [T| does not impose any extra 
restrictions on £ £ (0,1). Intuitively speaking, this compar¬ 
ison reveals the trade-off between computational complexity 
and compressibility requirements for convex optimization vs. 
greedy techniques, which is well-known for linear models @. 

The constants d„i. d ' t , i = 1, - • -, 4, a and /3 are explicitly 
given in the proof of the theorems in Appendix [A] As for a 
typical numerical example, for 7 T m i n = 0.01 and 7 r max = 0.49, 
the constants of Theorem Q] can be chosen as d\ ~ 10 3 , d 2 = 
50, g ?3 ~ 10 4 and c4 = 4. We will next give a sketch of the 
proof of these theorems. 

Proof Sketches of Theorems 1 and 2. The main ingredient in 
the proofs of Theorems |T| and [ 2 ] is inspired by the beautiful 
treatment of Negahban et al. in [;27j in establishing the notion 
of Restricted Strong Convexity (RSC). By the convexity of 
the negative Jacod log-likelihood given by Eq. ([]}, it is 
clear that a small change in 9 results in a small change in 
the negative Jacod log-likelihood. However, the converse is 
not necessarily true. Intuitively speaking, the RSC condition 
guarantees that the converse holds: a small change in the log- 
likelihood implies a small change in the parameter vector, i.e., 
the log-likelihood is not too flat around the true parameter 
vector. A depiction of the RSC condition for p = 2, adopted 
from ll27l . is given in Figure [T] In Figure [Ha), the RSC does 
not hold since a change along 0 2 does not change the log- 
likelihood, whereas the log-likelihood in Figure [Qb) satisfies 
the RSC. 

(b) 

0i 

Fig. 1: Illustration of RSC (a) RSC does not hold (b) RSC does 
hold. 

More formally, if the log-likelihood is twice differentiable 
at 9, the RSC is equivalent to existence of a lower quadratic 
bound on the negative log-likelihood: 

S £ (r/>, 9) := £(0 + fit) - 2(9) - fit'S/2(9) > K\\fit\\l (9) 

for a positive constant k > 0 and all fit £ W in a carefully- 
chosen neighborhood of 9 depending on s and £. Based 
on the results of 123 and ESI. when the RSC is satisfied, 
sufficient conditions akin to those in Theorems G] and [ 2 ] can 
be obtained by estimating the Euclidean extent of the solution 
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set around the true parameter vector (see Propositions 0 and 
[4] in Appendix lAl). 

The major technical challenge for the canonical self-exciting 
process, as opposed to the GLM models with i.i.d. covariates 
in E3, lies in the fact that the covariates are highly inter¬ 
dependent as they are formed by the history of the process. 
Hence, it is not straightforward to establish RSC with high 
probability, as the large deviation techniques used for i.i.d. 
random vectors do not hold. We establish the RSC for the 
canonical self-exciting process in two steps (see Lemma Q] 
in Appendix [A}. First, we show that RSC holds for the 
expected value of the negative log-likelihood E[£(0)], and 
then by invoking results on concentration of dependent random 
variables show that the negative log-likelihood £(9) resides 
in a sufficiently small neighborhood of E[£(0)] with high 
probability, and hence satisfies the RSC. 

The remainder of the proof of Theorem [T] establishes that 
upon satisfying the RSC, the estimation error can be suitably 
bounded (Proposition^ Appendix lAb. Similarly, Theorem 2 is 
proven using the RSC together with the results adopted from 
Eol on the performance of OMP for convex cost functions 
(Proposition [4] Appendix [All. 

Extensions. For simplicity and clarity of presentation, we have 
opted to present the proofs for the case of known // and for 
the canonical self-exciting process. The following corollary 
extends our results to the case of unknown //: 

Corollary 1. The claims of Theorems [7] and \2\ hold when p, 
is not known , except for possibly slightly different constants. 

Proof The proof is given in Appendix [A] □ 

The canonical self-exciting process can be generalized to 
a larger class of GLMs by generalization of its spiking 
probability function. In a more general form we can consider 
a spiking probability function given by 

\i=4>(p + 0'xilp) , 

where (/>(■) is a possibly nonlinear function for which 0 < 
A i < 1. In their continuous form, such processes are re¬ 
ferred to as the nonlinear Hawkes process lf34ll . Two of the 
commonly-used models in neural data analysis are the log- 
link and logistic-link models. Our prior numerical studies in 
E3 revealed a similar performance improvement of the t\- 
regularized ML and the greedy solution over the ML estimate 
for the log-link model. Stationarity of these discrete processes 
can be proved similar to the canonical self-exciting process 
(see Appendix[B|). The latter fact is key to extending our proofs 
to other models and is summarized by the following corollary: 

Corollary 2. Theorems [7] and [2] hold when the spiking prob¬ 
ability is given by Xi = <f (p + 9'xff 1 ^ for some continuous, 
bounded, convex and twice-differentiable function </>(•) (e.g., 
<p(x) = exp(x) or </>(x) = logit ~ 1 (x)) for which 0 < A, < 
1 /2, except for different constants. 

Proof. The proof is given in Appendix [A] □ 

IV. Application to Simulated and Real Data 

In this section, we study the performance of the conven¬ 
tional ML estimator, the t \-regularized ML estimator, and the 


POMP estimator on simulated data as well as real spiking data 
recorded from the mouse’s lateral geniculate nucleus (LGN) 
neurons and the ferret’s retinal ganglion cells (RGC). We have 
archived a MATLAB implementation of the estimators used 
in this paper using the CVX package lf36l on the open source 
repository GitHub and made it publicly available 1 1371 1. 

A. Simulation Studies 

In order to simulate spiking data governed by the canonical 
self-exciting process, we sequentially generate spikes using 
El- We have used p = 0 . 1 , 7 r m i n = 0 . 01 , 7 r max = 0.49, 
p = 1000, s = 3 and n = 950 for simulation purposes. Figure 
[2] shows 500 samples of the canonical self-exciting process 
generated using a history dependence parameter vector shown 
in Figure [2a). The parameter vector 9 is compressible with 
a sparsity level of s = 3 and 0 - 3 ( 6 ) = 0.05. A value of 
= 0.1 is used to obtain the i \-regularized ML estimate, 
which is slightly tuned around the theoretical estimate given 
by Theorem Q] Figures [2b), E c )> and Ed) show the esti¬ 
mated history dependence parameter vectors using ML, i\- 
regularized ML, and POMP, respectively. It can be readily 
visually observed that regularized ML and POMP significantly 
outperform the ML estimate in finding the correct values of 
9. More specifically, the components at lags 405 and 800 
(indicated by the gray arrows) are underestimated by the ML 
estimator, and their contribution is distributed among several 
falsely identified smaller lag components. 
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Fig. 2: A sample of the simulated canonical self-exciting process. 

In order to quantify this performance gain, we repeated 
the same experiment by generating realizations corresponding 
to randomly chosen supports of size s = 3 for 9 and 
spike trains of length 10 2 < n < 10 6 . In each case, the 
magnitudes of the components of 9 were chosen to satisfy 
the assumptions ©• For a given 9, the mean-square-error 
(MSE) of the estimate 9 is defined as E{||0 — 0|| 2 }, where 
E{-} is the sample average over the realizations of the process. 
Figure [2 shows the results of this simulation, where a similar 
systematic performance gain is observed. The left segment of 
the plot (shaded in yellow) and the right segment correspond 
to the compressive (n < p) and denoising (n > p) regimes, 
respectively. Error bars on the plot indicate 90% quantiles of 
the MSE for this simulation obtained by multiple realizations. 
As it can be inferred from Figure [2 the l \-regularized ML 
and POMP have a systematic performance gain over the ML 
estimate in the compressive regime, where n -C p, with the 
former outperforming the rest. In the denoising regime, the 
performance of the i \-regularized and ML become closer, 
while the POMP saturates to a higher MSE floor. The latter 
observation can be explained by the fact that the POMP can 
only estimate s* components (including those of 9 S ), and fails 
to capture the (p— s*) compressible components. This results 
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(b) ML 


0.15 


<g^ 0 
- 0.15 

1 500 1000 

(c) t\ -regularized ML 


0.15 


a 

- 0.15 

1 500 1000 

(d) POMP 

Fig. 3: (a) True parameters vs. (b) ML, (c) l \-regularized ML, and 
(d) POMP estimates. 





n 

Fig. 4: MSE performance of the ML, £i -regularized ML and POMP 
estimators. 

in an MSE floor above that obtained by ML, for large values 
of n. 

The MSE comparison in Figure 0] requires one to know 
the true parameters. In practice, the true parameters are un¬ 
known, and statistical tests are typically used to assess the 
goodness-of-fit of the estimates to the observed data. We use 
the Kolmogorov-Smirnov (KS) test and the autocorrelation 
function (ACF) test to assess the goodness-of-fit. These tests 
are based on the time-rescaling theorem for point processes 
1381 , which states that if the time axis is rescaled using the 
estimated conditional intensity function of the inhomogeneous 
Poisson process, the resulting point process is a homogeneous 
Poisson process with unit rate. Thereby, one can test for the 
validity of the time-rescaling theorem via two statistical tests: 
the KS test reveals how close the empirical quantiles of the 



(c) POMP 

Fig. 5: KS and ACF tests at 95% confidence level, for the ML, 
i \-regularized ML and POMP estimates. 


time-rescaled point process to the true quantiles of a unit rate 
Poisson process, and the ACF test reveals how close the ISI 
distribution of the time-rescaled process is to the true ISI 
distribution of a unit rate Poisson process. Details of these tests 
are given in Appendix [D] Figure [5] shows the KS and ACF 
tests at a 95% confidence level for the ML t \-regularized ML, 
and the POMP estimates from Figure 0 The yellow shades 
mark the regions below the specified confidence levels. The 
ML estimate fails to pass the KS test, while the regularized 
and POMP estimates pass both tests. 


B. Application to Spontaneous Neuronal Spiking Activity 

1) Background and motivation: Early studies of sponta¬ 
neous neuronal activity from the cat’s cochlear nucleus ll39l 
marked a significant breakthrough in computational neuro¬ 
science by going beyond the so-called Poisson hypothesis, by 
which single neurons were assumed to be firing according 
to homogeneous Poisson statistics. The diversity of the ISIs 
deduced from the spontaneous activity of the cochlear neurons 
led to the development of more sophisticated statistical models 
based on renewal process theory, resulting in the Gamma 
and inverse Gaussian ISI descriptions of spontaneous neuronal 
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activity El, HD. Due to the analytical difficulties involved in 
working with these models, their generalization to a broader 
range of spiking statistics is not straightforward. 

In light of the more recent discoveries on the role of 
spontaneous neuronal activity in brain development ia, si, 
its relation to functional architecture El, and its functional 
significance in a variety of modalities including retinal l42l . 
visual El, auditory 8461 . hippocampal El, cerebellar El, 
and thalamic ||49l function, the modeling and analysis of this 
phenomenon has sparked a renewed interest among researchers 
in recent years. In particular, models based on GLMs have 
shown to overcome the analytical difficulties of the above- 
mentioned models based on renewal theory, and have been 
successfuly used in relating the spontaneous neuronal activity 
to instrinsic and extrinsic neural covariates El, El, El, 
m as well as inferring the functional connectivity of neuronal 
ensembles lf22l . 1231 . The above-mentioned results rely on the 
accuracy of the ML estimation of these models. In addition, 
the estimated parameters are typically sparse. Therefore, the 
£i -regularized ML and POMP estimators are expected to offer 
a more robust alternative than the ML, especially under the 
limited observation setting. 

In order to evaluate the performance of these estimators on 
real data, in the remainder of this section we will compare the 
performance of the ML, t \-regularized ML, and POMP esti¬ 
mators in modeling the spontaneous spiking activity recorded 
from two different types of neurons, namely the mouse’s 
lateral geniculate nucleus and the ferret’s retinal ganglion cells. 

In the following analysis, the regularization parameter 
7 „ was chosen using a two-fold cross-validation refinement 
around the value obtained from our theoretical results. The 
length of the history components p was chosen by first 
selecting a large enough p as an upper bound for the expected 
correlation length of neuronal spontaneous activity (estimated 
as ~ 1.5 s), followed by reducing p to the point where an 
increase in the history length does not result in significantly 
detected history components. 

2) Application to LGN spiking activity: We first compare 
the performance of the estimators on the LGN neurons. The 
LGN is part of the thalamus in the brain, which acts as a relay 
from the retina to the primary visual cortex iBTI . The data 
were recorded at 1ms resolution from the mouse LGN neurons 
using single-unit recording (52l . We used about 5 seconds of 
data from one neuron for the analysis. In order to capture the 
history dependence governing the spontaneous spiking activity 
of the LGN neuron, we model the spiking probability using 
the canonical self-exciting process model with p = 100 (A = 
lms). Figure [6] shows the spiking data used in the analysis. 



i 

Fig. 6: The LGN spiking data used in the analysis. 


Figure [7] shows the estimated history dependence parameter 
vectors using the three methods. Both the regularized ML (Fig¬ 
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(c) POMP 

Fig. 7: (a) ML, (b) l \-regularized ML, and (c) POMP estimates of 
the LGN spiking parameters. 


ure[7]'b)) and POMP (Figure 0c)) estimates capture significant 
history dependence components around a lag of 90-95 ms 
(marked by the upward arrows). In l20l . an intrinsic neuronal 
oscillation frequency of around 10 Hz has been reported in 
around 30% of all classes of mouse retinal cells under exper¬ 
iment, using combined two-photon imaging and patch-clamp 
recording. Our results are indeed consistent with the above 
mentioned findings about the intrinsic spiking frequency of 
retinal neurons. To see this, we consider the power spectral 
density of the canonical self-exciting process given by: 


sv) = 


27r 



(1 — I'd) 2 |1 — 0(w)| 2 


( 10 ) 


where 0(w) is the discrete-time Fourier transform of 6 
and 7T* = //,/(l — I'd) denotes the stationary distribution 
probability of spiking. The derivation of the power spectral 
density is given in Appendix [B] The power spectral density 
of the canonical self-exciting process resembles the Bartlett 
spectrum of the Hawkes process El, El, El, whose 
peaks correspond to the significant oscillatory components of 
the underlying process. Our estimated parameter vectors 6 
using the regularized ML and POMP have significant nonzero 
components around lags of 90 < k < 95. As a result, S(oj) 
peaks at ui = Hence, f = is an estimate of the 
significant intrinsic frequency of the underlying self-exciting 
process. Using the estimated numerical values, the intrinsic 
frequency is around 10.5-11 Hz, which is consistent with 
experimental findings of |20l . Compared to the method in l20l . 
our estimates are obtained using much shorter recordings of 
spiking activity and provide a principled framework to study 
the oscillatory behavior of LGN neurons using sparse GLM 
estimation. 
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Fig. 9: Segment of the RGC spiking data used in the analysis. 

electrode array from the ferret retina at 50 ps tsa. We used 
5 seconds of data from one neuron for the analysis (neuron 
2, session 1 , adult data set, CARMEN data base ED)- Figure 
[9] shows a segment of the spiking data used in our analysis. 
The RGC activity in the adult ferret is characterized by bursts 
of activity with a mean firing rate of 9 ± 7 Hz, which are 
separated by 0.5-1 s intervals lf55fl . 

In order to capture the history dependence governing the 
spontaneous spiking activity of the RGC neuron, we model the 
spiking probability using two different link models to further 
corroborate the generalization of our results to models beyond 
the canonical self-exciting process studied in this paper. First, 
we consider the canonical self-exciting process model. We 
have chosen 7r max = 0.49, p = 50 (A = 25 ms) and s* = 3. 
The baseline parameter //. is estimated from the data and is set 
to be equal to half of empirical mean firing rate of the neuron. 

Figure[[0] shows the estimated history components using the 
three estimators. All three estimates capture significant self¬ 
exciting history dependence components around the lags of 
150 ms and 0.65-0.75 s (marked by the upward arrows). In¬ 
voking the foregoing argument for the LGN neuron regarding 
the power spectral density of the process dTOl i. these estimated 
lag components are consistent with the empirical estimates of 
El, as they indicate that the data can be characterized by a 
combination of 150 1 ms = 6.66 Hz bursts separated by gaps of 


Fig. 8: KS and ACF tests at 99% confidence level, for the ML, 
i \-regularized ML and POMP estimates. 

Note that there is a difference in the orders of magnitudes 
of the POMP estimate compared to the ML and regularized 
ML estimates. This is due to the fact that the POMP estimate 
is exactly s-sparse, whereas the ML and regularized ML 
estimates consist of p — 100 non-zero values. In order to 
assess the goodness-of-fit of these estimates, we invoke the KS 
and ACF tests. Figure [8] shows the corresponding KS and ACF 
test plots. As it is implied from Figure da), the ML estimate 
fails both tests due to overfitting, whereas the regularized ML 
(Figure [8jb)) passes both tests at the specified confidence 
levels. The POMP estimate (Figure |]c)), however, passes 
the KS test while marginally failing the ACF test. The latter 
observation implies that the seemingly negligible components 
of the parameter vector captured by the regularized ML 
estimate seem to be important in explaining the statistics of 
the observed data. 

3) Application to RGC spiking activity: We will next study 
the performance of the estimators on spiking data recorded 
from the RGCs of neonatal and adult ferrets E3. The retinal 
ganglion cells are located in the innermost layer of the retina. 
They integrate information from photoreceptors and project 
them into the brain |56). The data were recorded using a multi- 




(b) i \-regularized ML 



Fig. 10: (a) ML, (b) G-regularized ML, and (c) POMP estimates of 
the RGC spiking parameters using the canonical self-exciting process 
model. 
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(a) ML 



(b) l \-regularized ML 



(c) POMP 

Fig. 11: KS and ACF tests at 95% confidence level, for the ML, h- 
regularized ML and POMP estimates using the canonical self-exciting 
process model. 


length 0.65-0.75 s. The ML estimator predicts an extra self- 
inhibitory (negative) component, which results in over-fitting 
the data. This phenomenon can be observed by noting that the 
ML estimate fails the KS test shown in Figure fill 

We will next consider a logistic link model of the form 
A t = cxp ^ +e ^--p) with (j = i00. This model is 

C+exp(/j+0'x i _. p ) 

widely used in neuronal modeling literature (e.g., (H, El), 
where the assumptions given by ® are dropped and the 
optimization is performed in an unconstrained fashion. We 
adopt this approach and obtain all the estimates by dropping 
the assumptions of (*). Figure fl2l shows the estimated history 
components using the unconstrained estimators. Compared to 
the canonical self-exciting process model with a linear link, 
both the regularized ML (Figure fTSlbl) and POMP (Figure 
fl2?c)l estimates capture similar significant self-exciting history 
dependence components, which are consistent across the two 
sets of estimates. 

The KS and ACF test results for this case are very similar 
to Figure QT| are are thus omitted for brevity. In order to 
further inspect the goodness-of-fit of these methods, we plot 
the estimated spiking probabilities in Figure [Q] The ML 
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(b) i \-regularized ML 



(c) POMP 

Fig. 12: (a) ML, (b) i \-regularized ML, and (c) POMP estimates of 
the RGC spiking parameters using the logistic link model. 



(b) l\ -regularized ML 



(c) POMP 

Fig. 13: (a) ML, (b) l \-regularized ML, and (c) POMP estimates 
of the RGC spiking probability using the unconstrained logistic link 
model. Blue vertical lines show the locations of the spikes, and red 
traces show the estimated probabilities. 


estimate shown in Figure fl3l a) overfits the spiking events by 
rapidly saturating the rate to either 0 and 1, which results in 
undesired high rate estimates where there are no spikes. On 
the contrary, the regularized ML (Figure IT3lb)) and POMP 
(Figure 1 1 3 I 'd') provide a more reliable estimate of the rates 
consistent with the spiking events. This analysis suggests that 
the sufficient assumptions of (*) are not necessary for the 
superior performance of the regularized and POMP estimators 
over that of ML. 
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V. Conclusion and Future Work 

In this paper, we studied the sampling properties of t^- 
regularized ML and greedy estimators for a canonical self¬ 
exciting process. The main theorems provide non-asymptotic 
sampling bounds on the number of measurements, which lead 
to stable recovery of the parameters of the process. To the best 
of our knowledge, our results are the first of this kind, and can 
be readily generalized to various other classes of self-exciting 
GLMs, such as processes with logarithmic or logistic links. 

Compared to the existing literature, our results bring about 
two major contributions. First, we provide a theoretical un¬ 
derpinning for the advantage of l \-regularization in ML es¬ 
timation as well as greedy estimation in problems involv¬ 
ing binary observations. These methods have been used in 
neuroscience in an ad-hoc fashion. Our results establish the 
utility of these techniques by characterizing the underlying 
sampling trade-offs. Second, our analysis relaxes the widely- 
assumed hypotheses of i.i.d. covariates. This assumption is 
often violated when working with history-dependent data such 
as neural spiking data. 

We also verified the validity of our theoretical results 
through simulation studies as well as application to real 
neuronal spiking data from mouse’s LGN and ferret’s RGC 
neurons. These results show that both the regularized ML 
and the greedy estimates significantly outperform the widely- 
used ML estimate. In particular, through making a connection 
with the spectrum of discrete point processes, we were able 
to quantify the estimation of the intrinsic firing frequency of 
LGN neurons. In the spirit of easing reproducibility, we have 
archived a MATLAB implementation of the estimators studied 
in this work using the CVX package j3§] on the open source 
repository GitHub and made it publicly available f37l . 

One of the limitations of our analysis is the assumption that 
the spiking probabilities are bounded by 1/2, which results 
in loss of generality. This assumption is made for the sake 
of theoretical analysis in bounding the mixing rate of the 
canonical self-exciting process. Our numerical experiments 
suggest that it is not necessary for the operation of the 
£i -regularized and POMP estimators. We consider further 
inspection of the mixing properties of this process and thus 
relaxing this assumption as future work. Our future work also 
includes generalization of our analysis to multivariate GLMs, 
which will allow to infer network properties from multi-unit 
recordings of neuronal ensembles. 
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Appendix A 

Proofs of Main Theorems 

A. Roadmap of the Proofs 

This appendix contains the proofs of Theorems HI and |2] 
as well as Corollaries Q] and [2] Before presenting the proofs, 
we establish some of the basic properties of the canonical 
self-exciting process (Proposition []} as well as our notational 
conventions as preliminaries. We then state a key result, 


namely Lemma [I] which is at the core of the proofs of 
Theorems [T| and [2] The proofs are presented via a sequence of 
three propositions (Propositions [30 based on existing results 
in the literature, in conjunction with Proposition |T] and Lemma 
[3 Therefore, Appendix [A] is stand-alone modulo the proofs of 
Proposition Q] and Lemma 0] 

The proofs of Proposition Q] and Lemma [3 are presented in 
Appendix [B] In particular, the proof of Lemma (3 follows from 
two propositions (Propositions 0 and [3- Therefore, Appendix 
iBl is stand-alone modulo the proofs of Propositions [3 and [3 

While Proposition [3 is a well-known result, Proposition [3 
requires a careful proof, which is presented in Appendixlcland 
relies on an existing result on the concentration of dependent 
random variables (Proposition 0. 


B. Preliminaries and Notation 

We state some useful properties of the canonical self¬ 
exciting process in the form of the following proposition: 


Proposition 1. [Properties of the Canonical Self-Exciting 
Process] The canonical self-exciting process is stationary and 
we have 


= 1 ^ ve > 0, p> Q=> I'd < 1, n + 


1 ' 9 + < 1 , 


SM = ^ + 


(i-i'0) 2 |i-e(w)| 2 y ’ 


Q , s ^ 7r*(! - TT*) _. 

‘ " - 2 tt (1 + 2^ max ) 4 ' 


where 7r* denotes the stationary probability of spiking, S(uj) 
denotes the power spectral density of the process, and 6 ± = 
max{±0, 0}. 


Proof. The proof is given in Appendix IBl 


□ 


The stationarity gap of 1 — I'd plays an important role 
in controlling the convergence rate of the process to its 
stationary distribution. Throughout the proof, we will also use 
the notation S p (t) := {u | ||iz|| p = t) to denote the p-norm 
ball of radius t. For simplicity of notation, we also define the 
n-sample empirical expectation as follows: 

1 " 

E n{f{x ■)} := -£/(*0 

i=1 

for any measurable function f(x.). Note that the subscript x. 
refers to an index in the set {1, 2, • • •, n}. 


C. Establishing the Restricted Strong Convexity 

The proof of Theorems Q] and [3 relies on establishing 
the Restricted Strong Convexity (RSC) for the negative log- 
likelihood given by 0. Recall that if the log-likelihood is 
twice differentiable with respect to 0, the RSC property 
implies the existence of a lower quadratic bound on the 
negative log-likelihood: 

2W, 0) := £(0 + VO - £(0) - <//V£(0) > k\MI (11) 

for a positive constant n > 0 and all satisfying: 

||V , s c lli< 3||t/>s||i+4||0sc||i. 


( 12 ) 
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for any index set S' C {1, 2, • • • ,p} of cardinality s. The latter 
condition is known as the cone constraint. 

The following key lemma establishes the Restricted Strong 
Convexity condition for the canonical self-exciting process: 

Lemma 1 (Restricted strong convexity of the canonical self- 
-exciting process). Let x” p+1 denote a sequence of sam¬ 
ples from the canonical self-exciting process with parameters 
{p, 9} satisfying the conditions given by (0). Then, for n > 
dis 2 / 3 p 2 / 3 logp, the negative log-likelihood function £(0) 
satisfies the RSC property with a positive constant k > 0 with 
probability at least 1 — 2 exp ^, for some constant c, 

and both n and c are only functions of d±, Ci, and 7r max . 

Proof. The proof is given in Appendix [B] □ 

Lemma Q] can be viewed as the key result in the proofs of 
Theorems Q] and [2] which follow next. 


D. Proof of Theorem [7] 

We first restate the main result of fl27ll concerning RSC and 
its implications in controlling the estimation error for GLMs: 

Proposition 2. For a negative log-likeliliood £(0) which 
satisfies the RSC with parameter n, every solution to the 
convex optimization problem © satisfies 


dsp-e 


< 


/ 2 r ) n <J s (9) 


with a choice of the regularization parameter 

7n > 2 ||V£(0)|| . 


(13) 


(14) 


Proof. The proof is a special case of Theorem 1 of lt27ll . □ 


The first term in the bound (ITdll is increasing in s and 
corresponds to the estimation error of the s largest components 
of 9 in magnitude, whereas the second term is decreasing in 
s and represents the cost of replacing 9 with its best s-sparse 
approximation. 

Given the results of Lemma |T| and Proposition [2j it only 
remains to establish an upper bound on j n . To this end, we 
establish a suitable upper bound on ||V£(0)|| OO which holds 
with high probability and provides the appropriate scaling of 
7 n . From Eq. ©, we have 

1 ” x i_1 

(15) 

i =1 v 7 

We proceed in two steps: 

Step 1. We first show that 

(16) 


To see this, we use the law of iterated expectations on the ith 
term as follows: 


E 


[xi - (p + e'xji)] 


= E 


= E 


E 


E 


pJ> Aj(l — Xi) 


Xi -(p + G^zl)^=^ 


l ~ p Aj(l — Xi) 


l—p 


Xi~(p + 0'x*_ p ) 


J-l 


Zi Aj(l-Ai) 


Summing over i, establishes ©. 


= 0 (17) 


Step 2. We next show that the summation given by (IT5l) 
is concentrated around its mean. The iterated expectation 
argument used in establishing ( fI7t implies that the sequence 

is a martingale with respect to the filtration given by Ti = 
er(xi_ +:L ), where cr(-) denote the sigma-field generated by 
the random variables in its argument. We will now state 
the following concentration result for sums of bounded and 
dependent random variables |59l : 


Proposition 3. Fix n > 1. Let Zi ’s be bounded J 7 ,-measurable 
random variables, satisfying for each i = 1, 2, • ■ ■, n, 

E \Zi\Ti-f\ = 0, almost surely. 

Then there exists a constant c such that for all t > 0, 

P Zi — E [Zi] >t^j< exp (— cnt 2 ) . 

Proof. This result is a special case of Theorem 2.5 of |59j for 
bounded random variables. □ 


Proposition [3] implies that 

P (|(V£(0)) i | > t) < exp(— cnt 2 ). (18) 

By the union bound, we get: 

P( l|V£(6»)|| oo > t) < exp(—ct 2 n + logp). (19) 

Choosing t = ' 1+ C Q!1 for some o.\ > 0 yields 

P (||V£WIL > < 2exp(-a 1 logp) < 

Hence, a choice of 7 n = with ^2 := \J 1+ e ai satisfies 

(IT4l) with probability at least 1 — —gy. Combined with the 
result of Lemma [I] for n > d\s 2 ^ 3 p 2 ^ 3 logp, we have that 
the RSC is satisfied with a constant n with a probability at 
least 1-£- >1-f- for some constant an. The latter 

p a 2 — n a 2 z 

results in conjunction with Proposition [2] establishes the claim 
of Theorem 1. □ 

Remark. The choice of 7r m i n does not affect the proof of 
Theorem Q] and can be chosen as 0 in defining the set 0, 
thereby relaxing the first inequality in (*). However, as we 
will show below, the assumption of 7r m i n > 0 is required for 
the proof of Theorem [2] 


E[V£(0)] =0. 
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E. Proof of Theorem [2] 

The proof is mainly based on the following proposition, 
adopted from Theorem 2.1 of 01, stating that the greedy 
procedure is successful in obtaining a reasonable s*-sparse 
approximation, if the cost function satisfies the RSC: 

Proposition 4. Suppose that il(0) satisfies RSC with a con¬ 
stant k > 0. Let s* be a constant such that 

4 S 20s 

s* > — -log 2 -= e>(slogs), (20) 

<in K <in" 

Then, we have 

^ V&e s * 

— 1 
2 K, 

where e s * satisfies 


/i( s *) _ n 

^POMP Us 


£s* < \/s* _ -Fs||V£( 0 s )|| oo . (21) 


Proof. The proof is a specialization of the proof of Theorem 
2.1 in lf30l to our setting. □ 


Recall that Lemma Q] establishes the RSC for the negative 
log-likelihood function. In order to complete the proof of 
Theorem H it only remains to upper bound || V£(0 s )||oo. Let 
A i tS := p + 0'x- Zp. We have 


E[V£(0 a )] =E 


n 

- Y, l Xi - + e 's*T- P )\ 


x i_1 

"H-p 




1 

n 


■E E E[(0-e.)'x5=i|xj:i] 

i =1 L 

< C 2 (Js{0) 1 . 


i-p 


A, jS (l A; >s ) 


where we have used the fact that 0 < x, < 1 for all i, 
and C 2 := -rr--• Invoking the result of Proposition 

I - Tfmin (, J- 71”max ) 

[3] together with the union bound yields: 

P^||V£(0 a )| \oo>c 1 ^^ + c 2 a s (0^ < J r . 

for some constants Ci and j3-\ . Hence, we get the following 
concentration result for e s *: 


P ^e s * > Vs*Ts \J~~f + c 2 cr a ( 0 )^ ^ (22) 

Noting that by d20l l we have s* + s = 0(s log s ) < cos log s, 
for some constant Co, and invoking the result of Lemma [7] we 
get: 


q( s *) _ Q 

"POMP a s 


<4 


<4 


s log s logp 


4 s log scr s (0) 


s log s log p , log s 


s« 


4—2 ’ 


where d' 2 = ^/cqCi and d' 3 = ^Jcqc 2 . with proba¬ 

bility (l - exp (- s2 ^f g fy 2p2 ) ) (1 - Jr) ■ Choosing n > 


4s 2 ^ 3 (logs) 2 / 3 p 2 / 3 logp establishes the claimed success 
probability of Theorem [3 Finally, we have: 


o( s *) a 

^POMP u 

2 

n( s *) a 

^POMP 

- ( 

CG 

<£> 

+ - 


< 

fl\ s ) n 

^POMP Us 

2 + H4- 


Using ||0 S — 0||2< 1 7 S (0) = O (s 1 ^ completes the proof. 


F. Proofs of Corollaries [7] an d\2\ 

Proof of Corollary [7] The claim is a direct consequence of the 
boundedness of covariates and can be treated by replacing 6 
with the augmented parameter vector [p, 6'\ and augmenting 
the covariate vectors with an initial component of 1. The reader 
can easily verify that all the proof steps can be repeated in the 
same fashion. □ 

Proof of Corollary [2] The claim is a direct consequence of 
the boundedness of covariates which results in </>(•) being 
Lipschitz and hence the stationarity of the underlying process. 
Moreover, for twice-differentiable 0(-), the proof of Lemmaj]] 
in Appendix[A]can be generalized in a straightforward fashion. 
The reader can easily verify that all the remaining portions 
of the proofs of the main theorems can be repeated for such 
</>(•) in a similar fashion to that of the canonical self-exciting 
process. □ 


Appendix B 

Proofs of Proposition [Hand LemmaQ] 

A. Proof of Proposition [7] 

The canonical self-exciting process can be viewed as a 
Markov chain with states Xi = x*I 3 . Since each x, has two 
possible values, there are 2 P possible states. This Markov chain 
is irreducible since transition from any state to any other state 
is possible in at most p steps. Also, transition from an all-zero 
state to itself is possible. Hence the chain is aperiodic as well. 
This implies that there exists a stationary distribution for the 
Markov chain. We also know that if {Xi}^Z 1 is a stationary 
Markov Chain, then for any functional /(.), {f(Xi)}^2. 1 is 
a strictly stationary stochastic process (SSS). Therefore the 
canonical self-exciting process and the spiking probability 
sequence A" are both SSS. In particular, we have 

7T* := E[xj] = E [E [xj|Aj]] = E[A,] = p + 7r*l'0. 

Hence, the stationary probability 7r* satisfies: 

T 


In order to prove the first two inequalities, we make the 
necessary assumption that the baseline rate p is positive, due 
to the non-degeneracy assumption. In order to highlight the 
necessity of this condition, consider a sample path which 
contains p successive zeros starting from index * + 1 to i + p, 
corresponding to an all-zero covariate vector x*^ (note that 
this sample path will almost surely occur). We then have 
Aj +p+ i = p, + 0'x-tf = //. Therefore, if p is not positive, 
the process becomes degenerate. 
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The third inequality follows from the fact that for a covariate 
vector x'tj 1 with a support matching that of 0 we have 
Ai+p+i = /i + 0 ,x i+i = M + l , 0 + > which should be a 
valid probability. Moreover, the inequality is strict since the 
stationary probability 7r* = jfyjQ must be well-defined. 

We will next calculate the power spectral density of the 
process. Let r 00 ^ and denote the autocorrelation and 

autocovariance values of the process, respectively. By the 
stationarity of the process we have: 


Tk 


= E 
= E 


[x.+kX.] = E [xkx o] = E [E [xfcxolx*:^]] 


i n' k—1 

px o + 0 x k - P x o 


jj7 r* + O' r 


k -1 
k—p * 


The result of the Lemma Q] is equivalent to proving that 


E„ 


{$'X._1) 2 > Kill’ll 


(25) 


holds with probability greater than 1 — 2 exp ^ . Since 

both sides of ( f25l) are quadratic in i/>, the statement is equiv¬ 
alent to proving E„ [(i/i'xlj) 2 ] > k, for all ||t/)|| 2 G § 2 ( 1 ). 
We establish this in two steps: 


Step 1. First, we show that the statement holds for the true 
expectation: 


E 




> Kl > 0 


(26) 


for k > 0. Similarly, by subtracting the means we have the 
following identity for the autocovariance: 




= O'c 


k-l 

k—p’ 


(23) 


A straightforward calculation gives co = 7r* — 7r 2 . Eq. © 
resembles the Yule-Walker equations for an AR process of 
order p with parameter 0 and the innovations variance given 
by (7 2 = jjzrppyZ ■ Thus, the power spectral density of the 
canonical self-exciting process can be expressed as: 


^ = 27T 




( 1 - 1 ' 6 ») 2 | 1 - 0 ( w )| 2 


(24) 


We have 1 — I'O < 1 + ||0||i. Moreover, 


for some m which will be specified below, for all ||'i/’|| 2 € 
§ 2 ( 1 ). To establish the inequality l l26t . we use the following 
result: 

Proposition 5. Let R G R pxp be the pxp covariance matrix 
of a stationary process with power spectral density S(u)), and 
denote its maximum and minimum eigenvalues by A max (p) and 
Amin (p) respectively then A max (p) is increasing in p, A m i n (p) 
is decreasing in p and we have 

Amin ip) i inf S (cj), and A max (p) | sup S(ui). (27) 

“ U1 

Proof. This is a well-known result in stochastic processes. See 
ll60l l for a proof and detailed discussions. □ 


i-©MI= 

1 - ^ e k e~i“ k 

.. +ll ,, || Using Proposition[5j we can lower-bound E 

< 1 + ||0||i= 1 + II" ||i+||0 ||i 

(v<-p) 2 


k 

_ r / .. 1 x 21 .. 



by: 


< 1 + 2(7T max - At) < 1 + 27T n 


E 


(if'x.Jp) = t//R -Ip > A min (p) > inf S(w). 


which implies the lower bound on S(ui). 


□ Next, using Proposition Q] the bound of Eq. (1261 ) follows for 


B. Proof of Lemma [7] 

The proof is inspired by the elegant treatment of Negahban 
et al. |[27l . The major difficulty in the proof lies in the high 
inter-dependence of the covariates and observations. 

Noticing that the negative log-likelihood (|7]> is twice dif¬ 
ferentiable, a second order Taylor expansion of the negative 
log-likelihood ([]} around 6 yields: 

3) £ (^, 0) = £{0 + if,) - £(0) - i//V£(0) 


=-z 

71 * ^ 


w<-\y 


(/i + 0'x‘-p + \-p )) 2 


n ^—* 




(1 -p- - "Wx-l-DY 


^E(^=i) s 


i=1 


for some v £ [0,1], The inequality follows from the fact that 
both 0 and 0 + i/'f satisfy Q, and hence: 


p + 0'x‘_ p + vif’x\_ v < 7 r max < 1, 

1 ~ 17- O'X-lZl - vfj'yZfl 1 < 1 - 7T min < 1. 


Kl ■= 


7T*(1 - Tt*) 
2tt( 1 + 27T max ) 4 ' 


Step 2. We now show that the empirical and the true expec¬ 
tations of ( 'ip'x.Zp ) are close enough to each other. Let 


'£>ip,n E„ 




-E 




and 


:= sup |D 1 /’,„|. 
V’e§2(i) 


The final step in proving Lemma Q] is given by the following 
proposition: 

Proposition 6. We have 


__ m 

T) > — 

n ~ 4 . 


< 2 exp — 




2 2 
s z p z 


for some constant c. 

Proof. The proof is given in Appendix [Cl 


(28) 


□ 


Finally, the statement of Lemma |T| follows from Proposition 
[6] by taking n = Ki/ 4. □ 
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Appendix C 

Proof of Proposition [6] 

In order to establish the concentration inequality of Eq. ( f28l) . 
we need to invoke a result from concentration of dependent 
random variables. We proceed in two steps: 

Step 1. We first establish a geometric property of X> n , namely 
its 0( — )-Lipschitz property with respect to the normalized 
Hamming metric. Recall that the normalized Hamming metric 
between two sequences x" and y" is defined as d(x",y/) = 

First, by evaluating the first order optimality conditions of 
the solution d sp , it can be shown that the error vector ip = 
6 Sp — 6 satisfies the inequality: 

Il^|| 1 <3||^||i+4||0 s .||i, 

with S denoting the support of the best s-term approximation 
to 6 (see for example (271). By the assumption of as{0) = 
0{y/s), we can choose a constant Co such that cr g(0) < co^fs. 
Hence, 


Step 2. Next, we establish the concentration of 55„ around 
zero. Let H = [x*~^, 1] and H = [x*I 2 >0] be two vectors 
(history components) of length p which only differ in their 
last component, and let the mixing coefficient fj+j for j > i 
be defined as: 

Vij = \\p(x^\H) -p(x.]\H)\\ TV , (31) 

with || • || tv denoting the total variation difference of the 
probability measures induced on {0, l} n_ f +1 . Also, let 

Vij — SUp Tjij , and Qn,i •— 1 + + * * ' + 

H,H 

We now invoke Theorem 1.1 of |58| in the form of the 
following proposition: 

Proposition 7. IfT) n is C-Lipschitz and q := maxi <,< n Q n y, 
then 

( —2 nt 2 \ 

P[|S n -E[£»„]|> t] < 2 exp f - \ . 


||-0l|i< A\\ip s \\i+a s (0) < (4 + co)v^||^s||2< (4 + c 0 )a/s (29) 

where we have used the fact that ||i/’s , ||i< \/s||V , s||2< s/s for 
all ip € §2(1)- Therefore for alii S {1,2, • • •, n}, we have: 


Proof. The proof is identical to the beautiful treatment of 
|58l when specializing the underlying function of the variables 
x!_ p+1 to be 0„. □ 


0 < (^'x-_p) 2 < Mi < (4 + c 0 ) 2 s. 


As we showed in Step 1, C = C'sp/n, for some constant 
(30) £, we ^ave 


We first prove the claim for I),y n . To establish the latter, 
we need to prove 


£ ( V -' x *:)) 2 - W-IY 


< CdlxZ_ p+1 , y" p+1 ), 


for some C = 0(— ), or equivalently 


£ (</++ 2 - 


i= 1 


< C ^2 

i=—p+1 


for some C' = O(s). Let us start by setting the values of 
x" p+1 equal to those of y" p+1 and iteratively change Xj to 1— 
Xj for all indices j where x :) f y :l to obtain the configuration 
given by x" p+1 . For each such change (say Xj to 1 — Xj), the 
left hand side changes by at most 


where we have used the fact that each element of the measures 
p(x"|iT) andp(x"|iT) satisfies the assumption © and that the 
size of the state space {0, l} n_ f +1 is given by 2 n_ - J+1 . By the 
assumption (0, we have rjij < p n ~^ + 1 for p := 2 n max < 1. 
Hence, Q n y < for ah i, and q < -by definition. Using 
the result of Proposition |7J we get: 


> E[©„] + y 


< 2 exp 


-n 3 ^(l 

2 C's 2 p 2 ) ' 


(32) 


It only remains to show that the expectation in (32]) can 
be suitably bounded. Note that by a similar concentration 
argument for Q,/^ we have: 


n 


2=1 

< 


E (^' x E) | , - (^4)^=0 

1 

+ 2 Ei^ 


E[3) n ] = E[|SV, n | ]= J ( 


1 - F, 


3+2' 


dt 


u i-3 




< 3||t/>|| 2 < 3(4 +c 0 ) 2 s, < 2exp * ) 


dt = 2 a 


1 C'tt ps 
(1 — p) n 3 / 2 


where we have used the inequality given by Eq. (f30t . Hence, 
the C can be taken as 3(4 + co) 2 sp/n and the claim of the 
proposition for T)y, n follows. A very similar argument can 
be used to extend the claim to 35„. Let if* := t/»*(x™ p+1 ) 
be the ip for which the supremum in the definition of D n 
is achieved (such a choice of ip exists by the Weierstrass 
extreme value theorem). Since ip* also satisfies (1291 ). a similar 
argument shows that T> n is C7(^)-Lipschitz (with possibly 
different constants). 


Thus choosing n > dis 2 ^ 3 p 2 ^ 3 logp, for some positive con¬ 
stant d\, E[2)„] drops as 1/log 3//2 p, and will be smaller than 
ki/A for large enough p. Hence, combined with (32l ) and by 
defining c := we have: 


2>» > - 


< 2 exp 


—cn 3 K 2 \ 
s 2 p 2 ) ’ 


which establishes the claim of Proposition [6] 


□ 
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Appendix D 

Goodness-Of-Fit Tests for Point Process Models 

In this appendix, we will give an overview of the statistical 
tools used to assess the goodness-of-iit of point process 
models. A detailed treatment can be found in ll24l . 

The Time-Rescaling Theorem. Let 0 < t\ < t 2 < ■ • ■ be 
a realization of a continuous point process with conditional 
intensity A(t) > 0, i.e. tk is the first instance at which N(tk) = 
k. Define the transformation 

z k := Z(t k ) = f X(t)dt. (33) 

Jtk-1 

Then, the transformed point process with events occurring at 
t' k = ■_ i z k corresponds to a homogeneous Poisson process 

with rate 1. Equivalently, Zi,Z 2 ,— ■ are i.i.d exponential ran¬ 
dom variables. The latter can be used to construct statistical 
tests for the goodness-of-fit. 

The Komlogorov-Smirnov Test for Homogeneity. Suppose 
that we have obtained the rescaled process through ( fTT[ ) with 
the estimated conditional intensity. When applying the time¬ 
rescaling theorem to the discretized process, if the estimated 
conditional intensity is close to its true value, the rescaled 
process is expected to behave as a homogeneous Poisson 
process with rate 1. The Kolmogorov-Smirnov (KS) test can 
be used to check for the homogeneity of the process. Let 
Zk s be the rescaled times and define the transformed rescaled 
times by the inverse exponential CDF u k '■= 1 — e~ Zk . If the 
true conditional intensity was used to rescale the process, the 
random variables u k must be i.i.d. Uniform(0,1] distributed. 
The KS test plots the empirical qualities of u k ’s versus the true 
quantiles of the uniform density given by b k = , where 

J is the total number of observed spikes. If the conditional 
intensity is well estimated, the resulting curve must lie near the 
45° line. The asymptotic statistics of the KS distribution can be 
used to construct confidence intervals for the test. For instance, 
the 95% and 99% confidence intervals are approximately given 
by ±-^= and hulls around the 45° line, respectively. 

The Autocorrelation Function Test for Independence. In 
order to check for the independence of the resulting rescaled 
intervals z k , the transformation v k = < h _1 (ttfc) is used, where 
( 1> is the standard Normal CDF. If the true conditional intensity 
was used to rescale the process, then v k s would be i.i.d. Gaus¬ 
sian and their uncorrelatedness would imply independence. 
The Autocorrelation Function (ACF) of the variables v k must 
then be close to the discrete delta function. The 95% and 99% 
confidence intervals can be considered using the asymptotic 
statistics of the sample ACF, approximately given by 
and ± 2 '^j , respectively. 

Remark. The binning size used for discretizing the data can 
potentially affect the ISI distribution of the time-rescaled 
process. In order to avoid these issues, we have used the 
empirical ISI distribution estimated from a large realization 
of the process (estimated from the training data) as the null 
hypothesis for both tests (performed on the test data). 
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